Lab 0

EXAMPLE

These data appeared in the Wall Street Journal. The advertisement were selected by an annual survey conducted by Video Board Tests, Inc., a New York ad-testing company, based on interviews with 20,000 adults who were asked to name the most outstanding TV commercial they had seen, noticed, and liked. The retained impressions were based on a survey of 4,000 adults, in which regular product users were asked to cite a commercial they had seen for that product category in the past week. There are 21 observations on three variables.

Variables: Firm The name of the firm Spend TV advertising budget, 1983 ($ millions) MilImp Millions of retained impressions per week

These data is saved in text file (.txt) called TVads which will be imported to R as shown in th efollowing code. The path to the place the text file is saved must be changed accordingly.

TVads <- read.table("TVads.txt", header= TRUE) 
head(TVads) # This will print only the first 6 rows of data file 
##              Firm Spend MilImp
## 1     MILLER.LITE  50.1   32.1
## 2           PEPSI  74.1   99.6
## 3         STROH'S  19.3   11.7
## 4 FEDERAL.EXPRESS  22.9   21.9
## 5     BURGER.KING  82.4   60.8
## 6       COCO-COLA  40.1   78.6
TVads # This will print the complete data file in the Console Window 
##               Firm Spend MilImp
## 1      MILLER.LITE  50.1   32.1
## 2            PEPSI  74.1   99.6
## 3          STROH'S  19.3   11.7
## 4  FEDERAL.EXPRESS  22.9   21.9
## 5      BURGER.KING  82.4   60.8
## 6        COCO-COLA  40.1   78.6
## 7      MC.DONALD'S 185.9   92.4
## 8              MCI  26.9   50.7
## 9        DIET.COLA  20.4   21.4
## 10            FORD 166.2   40.1
## 11          LEVI'S  27.0   40.8
## 12        BUD.LITE  45.6   10.4
## 13        ATT.BELL 154.9   88.9
## 14    CALVIN.KLEIN   5.0   12.0
## 15         WENDY'S  49.7   29.2
## 16        POLAROID  26.9   38.0
## 17          SHASTA   5.7   10.0
## 18        MEOW.MIX   7.6   12.3
## 19     OSCAR.MEYER   9.2   23.4
## 20           CREST  32.4   71.1
## 21  KIBBLES.N.BITS   6.1    4.4
# The following plot function creates the scatterplot shown in Figure 1 below
plot(TVads$Spend, TVads$MilImp, pch=16, xlab = "TV advertising budget, 1983 ($ millions) ", ylab = "Millions of retained impressions per week", main="Figure 1: Scatterplot of Millions of retained impressions per week 
     and TV advertising budget", cex.main=0.9)

Simple linear regression analysis in R

model <- lm(MilImp ~ Spend, data = TVads) # The lm function is used to fit simple linear regression model
model# This displays only the values of estimated regression coefficients
## 
## Call:
## lm(formula = MilImp ~ Spend, data = TVads)
## 
## Coefficients:
## (Intercept)        Spend  
##     22.1627       0.3632
summary(model) # The summary function is used to get more information about the fitted regression model #
## 
## Call:
## lm(formula = MilImp ~ Spend, data = TVads)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.422 -12.623  -8.171   8.832  50.526 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 22.16269    7.08948   3.126  0.00556 **
## Spend        0.36317    0.09712   3.739  0.00139 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.5 on 19 degrees of freedom
## Multiple R-squared:  0.424,  Adjusted R-squared:  0.3936 
## F-statistic: 13.98 on 1 and 19 DF,  p-value: 0.001389

The output shows estimated regression coefficients, their standard errors (deviations), test statistics & p-value for testing significance. Also, there is R squared, degrees of freedom and F test statistic for testing significance of fitted model. You can use the abline function as shown below to plot the fitted regression model on the scatterplot.

plot(TVads$Spend, TVads$MilImp, pch=16, xlab = "TV advertising budget, 1983 ($ millions) ", ylab = "Millions of retained impressions per week", main="Figure 2: Fitted rgeression model oevrlayed on the scatterplot of 
     Millions of retained impressions per week and TV advertising budget", cex.main=0.9) # Create the scatterplot
abline(model, col = "red", lwd= 3) # Draw the fitted line #

The 95% CI for parameter estimates can be obtained using the following function:

confint(model, level = 0.95)
##                2.5 %     97.5 %
## (Intercept) 7.324244 37.0011425
## Spend       0.159899  0.5664492

The 95% confidence interval for slope of the regression line is (0.1599, 0.5664). If you want to get the predicted value of response for each value of predictor, then use the predict function as shown below.

predict(model, TVads)
##        1        2        3        4        5        6        7        8        9       10       11       12       13       14       15       16       17       18       19       20       21 
## 40.35771 49.07389 29.17195 30.47938 52.08824 36.72597 89.67675 31.93208 29.57144 82.52222 31.96839 38.72343 78.41836 23.97856 40.21244 31.93208 24.23279 24.92282 25.50389 33.92953 24.37806

This same function can be used to get prediction interval for the each value of predictor as shown below.

predict(model, TVads, interval = "prediction", level = 0.95)
##         fit        lwr       upr
## 1  40.35771  -9.989125  90.70455
## 2  49.07389  -1.502881  99.65067
## 3  29.17195 -21.570203  79.91411
## 4  30.47938 -20.176808  81.13557
## 5  52.08824   1.322964 102.85351
## 6  36.72597 -13.664345  87.11629
## 7  89.67675  32.288077 147.06543
## 8  31.93208 -18.640841  82.50499
## 9  29.57144 -21.143338  80.28623
## 10 82.52222  26.944377 138.10007
## 11 31.96839 -18.602607  82.53939
## 12 38.72343 -11.632825  89.07969
## 13 78.41836  23.773745 133.06297
## 14 23.97856 -27.207071  75.16420
## 15 40.21244 -10.134558  90.55945
## 16 31.93208 -18.640841  82.50499
## 17 24.23279 -26.927385  75.39296
## 18 24.92282 -26.170173  76.01581
## 19 25.50389 -25.534718  76.54251
## 20 33.92953 -16.550051  84.40912
## 21 24.37806 -26.767737  75.52385

You may want to compute prediction values of response for a given set of new values of predictor. In that case, first, create a new data file with the new value of predictor. Then use the predictor function to predict response as shown below. Here, our new values of the Spend variable are 50 and 100.

NewData <- data.frame(Spend = c(50, 100)) # Creates a new data frame with only new values of predictor variable
predict(model, NewData, interval = "prediction", level = 0.95) # Use the new data frame in the Predictor function instead of old data set
##       fit        lwr       upr
## 1 40.3214 -10.025471  90.66826
## 2 58.4801   7.133667 109.82653

The above output shows that there is $42.32 million of average retained impressions per week when TV advertisement budget is $ 50 million. To check the model diagnostics using residual plots, use the code below.

plot(model) # This plots the diagnostic graphs for the fitted regression model. Fit enter key 4 times to see the plots shown below #

When you want to do multiple linear regression in R, you can add other predictors to the lm function each separated by a plus sign.

The Solution by tidyverse

# install.packages("tidyverse")

To use R packages you have installed, include this line in your script or in the Console:

library("tidyverse")
# Note here the quotation marks are optional and
# it is the same as
library(tidyverse)

Import Data

  1. Download the data file here .
  2. What format is the data file in?
  3. Import the data in the file into R as a data frame:
my_df <- read_table2("TVads.txt")
my_df
## # A tibble: 21 x 3
##    Firm            Spend MilImp
##    <chr>           <dbl>  <dbl>
##  1 MILLER.LITE      50.1   32.1
##  2 PEPSI            74.1   99.6
##  3 STROH'S          19.3   11.7
##  4 FEDERAL.EXPRESS  22.9   21.9
##  5 BURGER.KING      82.4   60.8
##  6 COCO-COLA        40.1   78.6
##  7 MC.DONALD'S     186.    92.4
##  8 MCI              26.9   50.7
##  9 DIET.COLA        20.4   21.4
## 10 FORD            166.    40.1
## # ... with 11 more rows

Of course, the step of import data into your statistic software vary a little by software and by the data format.

More information as of how to import data into R can be found at Import Data

Descriptive Statistics

Visualization

  1. Visualize a single variable
ggplot(my_df, aes(x=MilImp)) + geom_histogram()

ggplot(my_df, aes(x=Spend)) + geom_histogram()

  1. Visualize a pair of numeric variables
ggplot(my_df, aes(x=Spend, y=MilImp)) + geom_point()

Correlation

cor(my_df$MilImp, my_df$Spend)
## [1] 0.6511151

Regression

In R, run linear regressions with lm (short for linear model):

lm(MilImp ~ Spend, data=my_df)
## 
## Call:
## lm(formula = MilImp ~ Spend, data = my_df)
## 
## Coefficients:
## (Intercept)        Spend  
##     22.1627       0.3632

More detailed regression results

  1. Pass the results from lm() to summary() for more detailed information:
lm(MilImp ~ Spend, data=my_df) %>%
  summary
## 
## Call:
## lm(formula = MilImp ~ Spend, data = my_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.422 -12.623  -8.171   8.832  50.526 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 22.16269    7.08948   3.126  0.00556 **
## Spend        0.36317    0.09712   3.739  0.00139 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.5 on 19 degrees of freedom
## Multiple R-squared:  0.424,  Adjusted R-squared:  0.3936 
## F-statistic: 13.98 on 1 and 19 DF,  p-value: 0.001389

For better formatting of the results (pretty print), we can use the texreg package:

## Install and load texreg package
# install.packages("texreg")
library(texreg)

# Pretty print regression results on screen
lm(MilImp ~ Spend, data=my_df) %>%
  screenreg
## 
## =====================
##              Model 1 
## ---------------------
## (Intercept)  22.16 **
##              (7.09)  
## Spend         0.36 **
##              (0.10)  
## ---------------------
## R^2           0.42   
## Adj. R^2      0.39   
## Num. obs.    21      
## RMSE         23.50   
## =====================
## *** p < 0.001, ** p < 0.01, * p < 0.05
# Save regression results to a html file
lm(MilImp ~ Spend, data=my_df) %>%
  htmlreg(file="lm_MilImp-Spend.html")

Visualize regression results

ggplot(my_df, aes(x=Spend, y=MilImp)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

Diagnostic plots of regression

We will use the ggfortify package to generate the diagnostic plots for regression

## Install and load ggfortify package
# install.packages("ggfortify")
library(ggfortify)

lm(MilImp ~ Spend, data=my_df) %>%
  autoplot()

# Since we need the regression results in many places, 
# it would be easier to save it into a R variable
MilImp_lm <- lm(MilImp ~ Spend, data=my_df)

# Save the diagnostic plots as a png file
ggsave("MilImp_lm_diag.png")

lab 1

A hospital is implementing a program to improve service quality and productivity. As part of this program the hospital management is attempting to measure and evaluate patient satisfaction. Table B.17 contains some of the data that have been collected on a random sample of 25 recently discharged patients. Variables: Satisfaction - a subjective measure on an increasing scale Age – age of patient in years Severity – an index measuring the severity of the patient’s illness Surgical-Medical - an indicator of whether the patient is a surgical or medical patient (0 = surgical, 1 = medical) Anxiety - an index measuring the patient’s anxiety level The patient satisfaction is thought to be related to the patient age.

https://rpubs.com/aaronsc32/regression-confidence-prediction-intervals

# Load the package 
library(readxl)
library(tidyverse)
library(broom)
# Import Data
table_b17 <- read_xlsx("TableB.17.xlsx")

# Observe the data frame
head(table_b17)

(a) Create a scatterplot of satisfaction and age variables. Copy and paste it here.

# visualizng Satisfaction and Age
table_b17 %>% ggplot(aes(Age,Satisfaction))+geom_point()+geom_smooth(method="lm", level=0.95)
  1. Describe the relationship between patient satisfaction and patient age based on the scatterplot.

The plot shows a decreasing approximately linear pattern.

  1. Fit a simple linear regression model to the data. Write the fitted regression model with estimated coefficients. Edit the following math equation to write your fitted model (replace y and x with appropriate variable names).

The model is \(Satisfaction=130.221-1.249Age\)

  1. The coefficient of determination, denoted by R^2, (Multiple R-squared in R) provides the percentage of total variation in response variable explained by the fitted model. It can be any value from 0 to 100% and a higher value is better. What is the percentage of total variation in the patient satisfaction explained by the fitted model for these data?
model_satis_age %>% summary()

According to the analysis-of-variance table, Multiple R-squared is 0.7205. The fitted model for these data can explain 72.05% of total variation in the patient satisfaction.

  1. Report the standard error of estimated intercept.

the standard error of estimated intercept is 7.7775.

  1. Using the p-value reported along with regression coefficients in your software output, explain whether the true slope of the fitted model is significant at 5% significance level. Report the p-value as well.

The table shows that p-value is less than 1.52e-08. The regression model is significant at 95% significance level.

  1. Report a 95% confidence interval for the true intercept of the model. Based on confidence interval, explain whether the true intercept is different from zero.

The equation for the 95% CI for the estimated β1 is defined as:

\[\beta_1 \pm t_{\alpha / 2, n - 2} \left(\frac{\sqrt{MSE}}{\sqrt{\sum (x_i - \bar{x})^2}}\right)\]

model_satis_age %>% confint(level=0.95)

The fitted β0 is 130.2209. The 95% confidence interval for the intercept of the regression line is (114.131844, 146.3100181).
The confidence interval for β0 does not contain 0, it can be concluded the true intercept is different from zero.

  1. Construct the analysis of variance (ANOVA) table using software and test for significance of regression model. Report the p-value.
anova(model_satis_age)

The table shows that p-value is less than 1.52e-08. The regression model is significant at 95% significance level.

  1. Compute the point estimate of the mean patient satisfaction when the patient is 65 years old.

The confidence interval around the mean response, denoted \(\mu_y\), when the predictor value is \(x_k\) is defined as:

\[\hat{y}_h \pm t_{\alpha / 2, n - 2} \sqrt{MSE \left(\frac{1}{n} + \frac{(x_k - \bar{x})^2}{\sum(x_i - \bar{x}^2)} \right)}\]

Where \(\hat{y}_h\) is the fitted response for predictor value \(x_h\), \(t_{\alpha/2,n-2}\) is the t-value with n−2 degrees of freedom, while \(\sqrt{MSE \left(\frac{1}{n} + \frac{(x_k - \bar{x})^2}{\sum(x_i - \bar{x}^2)} \right)}\) is the standard error of the fit.

model_satis_age %>% predict(newdata = data.frame(Age=65), interval = "confidence", level=0.95)

From the output, the fitted satisfaction is 49.03367 when the patient’s age is 65 years old.

  1. Compute 95% confidence interval for the mean (average) patient satisfaction when the patient is 65 years old. (Confidence interval is for mean or average response)

According the result from (i), the confidence interval of (42.86397, 55.20336) signifies the range in which the true population parameter lies at a 95% level of confidence.

  1. Compute the 95% prediction interval for the patient satisfaction when the patient is 65 years old. (Prediction interval is for a new value of response and not for mean response)
model_satis_age %>% predict(newdata = data.frame(Age=65), interval = "prediction", level=0.95)

the 95% prediction interval is (26.11012, 71.95721) for the patient satisfaction when the patient is 65 years old

  1. Compare the two intervals in the previous two questions and explain why they are same or different.

The results show the prediction interval is wider than confidence interval due to the additional term in the standard error of prediction. Prediction and confidence intervals are similar in that they are both predicting a response, however, they differ in what is being represented and interpreted.The best predictor of a random variable (assuming the variable is normally distributed) is the mean \(\mu\). The best predictor for an observation from a sample of x data points, x1,x2,⋯,xn and error \(\epsilon\) is \(\bar x\). The prediction interval depends on both the error from the fitted model and the error associated with future observations.

  1. Use the estimated slope and its standard error from software output to test whether the true slope is different from -1 (negative 1). Provide all the steps of test.
t0= (-1.2490+1)/0.1471
2*pt(q = t0, df = nrow(table_b17)-2 , lower.tail = TRUE) # Gives the P(T < t0) from the t distribution. Enter the value for df of the t distribution. Use FALSE instead of TRUE to compute P(T >t0). #

The result shows that the p_value is 0.1040118, which is bigger than 0.05. Therefore, we can reject the H1 and the ture slope might be same with -1.

lab 2

lab 3

class Nov 13th

lab 4

# Load the package 
library(readxl)
library(tidyverse)
library(broom)
library(olsrr)
library(car)
# Import Data
table_b4 <- read_table2("TableB4.txt")
# Observe the data frame
head(table_b4)
## # A tibble: 6 x 10
##       y    x1    x2    x3    x4    x5    x6    x7    x8    x9
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int>
## 1  25.9  4.92     1  3.47 0.998     1     7     4    42     0
## 2  29.5  5.02     1  3.53 1.5       2     7     4    62     0
## 3  27.9  4.54     1  2.28 1.18      1     6     3    40     0
## 4  25.9  4.56     1  4.05 1.23      1     6     3    54     0
## 5  29.9  5.06     1  4.46 1.12      1     6     3    42     0
## 6  29.9  3.89     1  4.46 0.988     1     6     3    56     0
summary(table_b4)
##        y               x1              x2              x3              x4              x5              x6            x7              x8              x9      
##  Min.   :25.90   Min.   :3.891   Min.   :1.000   Min.   :2.275   Min.   :0.975   Min.   :0.000   Min.   :5.0   Min.   :2.000   Min.   : 3.00   Min.   :0.00  
##  1st Qu.:29.90   1st Qu.:5.057   1st Qu.:1.000   1st Qu.:4.855   1st Qu.:1.161   1st Qu.:1.000   1st Qu.:6.0   1st Qu.:3.000   1st Qu.:30.00   1st Qu.:0.00  
##  Median :33.70   Median :5.974   Median :1.000   Median :5.685   Median :1.432   Median :1.000   Median :6.0   Median :3.000   Median :40.00   Median :0.00  
##  Mean   :34.61   Mean   :6.405   Mean   :1.167   Mean   :6.033   Mean   :1.384   Mean   :1.312   Mean   :6.5   Mean   :3.167   Mean   :37.46   Mean   :0.25  
##  3rd Qu.:38.15   3rd Qu.:7.873   3rd Qu.:1.500   3rd Qu.:7.158   3rd Qu.:1.577   3rd Qu.:2.000   3rd Qu.:7.0   3rd Qu.:3.250   3rd Qu.:48.50   3rd Qu.:0.25  
##  Max.   :45.80   Max.   :9.142   Max.   :1.500   Max.   :9.890   Max.   :1.831   Max.   :2.000   Max.   :8.0   Max.   :4.000   Max.   :62.00   Max.   :1.00
# visualizng Satisfaction and Age
plot(table_b4,pch=16)

# matrix of correlation coefficients
cor(table_b4)
##             y         x1         x2         x3         x4          x5        x6        x7          x8        x9
## y   1.0000000  0.8759762  0.7128360  0.6493554  0.7101541  0.45885885 0.5308653 0.2827929 -0.39851772 0.2623629
## x1  0.8759762  1.0000000  0.6512433  0.6892118  0.7342610  0.45857369 0.6405621 0.3670500 -0.43711531 0.1466709
## x2  0.7128360  0.6512433  1.0000000  0.4129438  0.7285916  0.22402204 0.5103104 0.4264014 -0.10074847 0.2041241
## x3  0.6493554  0.6892118  0.4129438  1.0000000  0.5715551  0.20466834 0.3921362 0.1516178 -0.35276370 0.3059960
## x4  0.7101541  0.7342610  0.7285916  0.5715551  1.0000000  0.35888351 0.6788606 0.5743353 -0.13908686 0.1065612
## x5  0.4588588  0.4585737  0.2240220  0.2046683  0.3588835  1.00000000 0.5893871 0.5412988 -0.02016883 0.1016185
## x6  0.5308653  0.6405621  0.5103104  0.3921362  0.6788606  0.58938707 1.0000000 0.8703883  0.12426629 0.2222222
## x7  0.2827929  0.3670500  0.4264014  0.1516178  0.5743353  0.54129880 0.8703883 1.0000000  0.31351144 0.0000000
## x8 -0.3985177 -0.4371153 -0.1007485 -0.3527637 -0.1390869 -0.02016883 0.1242663 0.3135114  1.00000000 0.2257796
## x9  0.2623629  0.1466709  0.2041241  0.3059960  0.1065612  0.10161846 0.2222222 0.0000000  0.22577960 1.0000000
table_b4 %>% corrr::correlate() %>% corrr::fashion()
##    rowname    y   x1   x2   x3   x4   x5   x6   x7   x8   x9
## 1        y       .88  .71  .65  .71  .46  .53  .28 -.40  .26
## 2       x1  .88       .65  .69  .73  .46  .64  .37 -.44  .15
## 3       x2  .71  .65       .41  .73  .22  .51  .43 -.10  .20
## 4       x3  .65  .69  .41       .57  .20  .39  .15 -.35  .31
## 5       x4  .71  .73  .73  .57       .36  .68  .57 -.14  .11
## 6       x5  .46  .46  .22  .20  .36       .59  .54 -.02  .10
## 7       x6  .53  .64  .51  .39  .68  .59       .87  .12  .22
## 8       x7  .28  .37  .43  .15  .57  .54  .87       .31  .00
## 9       x8 -.40 -.44 -.10 -.35 -.14 -.02  .12  .31       .23
## 10      x9  .26  .15  .20  .31  .11  .10  .22  .00  .23
table_b4 %>% corrr::correlate() %>% corrr::rplot()

# build the model
model_b4_full <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, data=table_b4)
model_b4_full%>% summary()
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, 
##     data = table_b4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.720 -1.956 -0.045  1.627  4.253 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 14.92765    5.91285   2.525   0.0243 *
## x1           1.92472    1.02990   1.869   0.0827 .
## x2           7.00053    4.30037   1.628   0.1258  
## x3           0.14918    0.49039   0.304   0.7654  
## x4           2.72281    4.35955   0.625   0.5423  
## x5           2.00668    1.37351   1.461   0.1661  
## x6          -0.41012    2.37854  -0.172   0.8656  
## x7          -1.40324    3.39554  -0.413   0.6857  
## x8          -0.03715    0.06672  -0.557   0.5865  
## x9           1.55945    1.93750   0.805   0.4343  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.949 on 14 degrees of freedom
## Multiple R-squared:  0.8531, Adjusted R-squared:  0.7587 
## F-statistic: 9.037 on 9 and 14 DF,  p-value: 0.000185
anova(model_b4_full)
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x1         1 636.16  636.16 73.1525 6.238e-07 ***
## x2         1  29.18   29.18  3.3551   0.08836 .  
## x3         1   4.71    4.71  0.5416   0.47391    
## x4         1   0.03    0.03  0.0032   0.95537    
## x5         1   8.78    8.78  1.0091   0.33216    
## x6         1  13.03   13.03  1.4982   0.24115    
## x7         1   9.14    9.14  1.0515   0.32254    
## x8         1   0.64    0.64  0.0741   0.78943    
## x9         1   5.63    5.63  0.6478   0.43435    
## Residuals 14 121.75    8.70                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vcov(model_b4_full)
##             (Intercept)         x1          x2           x3           x4           x5           x6          x7           x8          x9
## (Intercept)  34.9618142  1.4185704 -8.05381699 -0.578910413  1.544918516  0.590245661 -9.210285660  9.29003130 -0.122000743  5.23739149
## x1            1.4185704  1.0606966 -1.95694395 -0.188439886 -0.796712965 -0.449355607 -1.401396213  1.56627449  0.024816000  0.47756201
## x2           -8.0538170 -1.9569440 18.49319704  0.498255845 -7.835636978  1.385215724  2.985701832 -4.19172566 -0.011989093 -2.61735808
## x3           -0.5789104 -0.1884399  0.49825584  0.240479765 -0.613877784  0.105719049  0.097666359 -0.08053358  0.004479017 -0.33268222
## x4            1.5449185 -0.7967130 -7.83563698 -0.613877784 19.005706515  0.473391133 -0.488584336 -2.34434350 -0.004688778  1.04295974
## x5            0.5902457 -0.4493556  1.38521572  0.105719049  0.473391133  1.886525901  0.147174179 -1.33029505  0.007405142 -0.47388005
## x6           -9.2102857 -1.4013962  2.98570183  0.097666359 -0.488584336  0.147174179  5.657473253 -6.79090191 -0.003215709 -2.20865227
## x7            9.2900313  1.5662745 -4.19172566 -0.080533584 -2.344343503 -1.330295048 -6.790901910 11.52970487 -0.058993674  3.53532750
## x8           -0.1220007  0.0248160 -0.01198909  0.004479017 -0.004688778  0.007405142 -0.003215709 -0.05899367  0.004451546 -0.04896325
## x9            5.2373915  0.4775620 -2.61735808 -0.332682217  1.042959743 -0.473880052 -2.208652269  3.53532750 -0.048963249  3.75389020
vif(model_b4_full)
##        x1        x2        x3        x4        x5        x6        x7        x8        x9 
##  7.021036  2.835413  2.454907  3.836477  1.823605 11.710101  9.722663  2.320887  1.942494
ols_vif_tol(model_b4_full)
## # A tibble: 9 x 3
##   Variables Tolerance   VIF
##   <chr>         <dbl> <dbl>
## 1 x1           0.142   7.02
## 2 x2           0.353   2.84
## 3 x3           0.407   2.45
## 4 x4           0.261   3.84
## 5 x5           0.548   1.82
## 6 x6           0.0854 11.7 
## 7 x7           0.103   9.72
## 8 x8           0.431   2.32
## 9 x9           0.515   1.94
plot(model_b4_full,pch=16,col="blue")

avPlots(model_b4_full)

ols_plot_added_variable(model_b4_full)

yhat_b4_full <- model_b4_full$fit # Save predicted y to an object #
t_b4_full <- rstudent(model_b4_full) # Save studentized residuals to an object #
# Another way to check non-normality of studentized residuals #
hist(t_b4_full)

ols_plot_resid_hist(model_b4_full)

qqnorm(t_b4_full, pch=15)

ols_plot_resid_qq(model_b4_full)

shapiro.test(t_b4_full) #If p-value is bigger, then no problem of non-normality #
## 
##  Shapiro-Wilk normality test
## 
## data:  t_b4_full
## W = 0.96485, p-value = 0.5432
plot(yhat_b4_full,t_b4_full, pch=16) # Studentized residuals versus predicted y #

plot(table_b4$x1,t_b4_full, pch=16) # Studentized residuals versus x1 predictor #

plot(table_b4$x2,t_b4_full, pch=16) # Studentized residuals versus x2 predictor #

plot(table_b4$x3,t_b4_full, pch=16) # Studentized residuals versus x3 predictor #

plot(table_b4$x4,t_b4_full, pch=16) # Studentized residuals versus x4 predictor #

plot(table_b4$x5,t_b4_full, pch=16) # Studentized residuals versus x5 predictor #

plot(table_b4$x6,t_b4_full, pch=16) # Studentized residuals versus x6 predictor #

plot(table_b4$x7,t_b4_full, pch=16) # Studentized residuals versus x7 predictor #

plot(table_b4$x8,t_b4_full, pch=16) # Studentized residuals versus x8 predictor #

plot(table_b4$x9,t_b4_full, pch=16) # Studentized residuals versus x9 predictor #

table_b4 %>% mutate(student_residual=t_b4_full) %>% plot()

ols_plot_diagnostics(model_b4_full)

ols_plot_dfbetas(model_b4_full)

# Lack of Fit F Test
ols_pure_error_anova(lm(y~x1, data = table_b4))
## Lack of Fit F Test 
## ---------------
## Response :   y 
## Predictor:   x1 
## 
##                   Analysis of Variance Table                    
## ---------------------------------------------------------------
##                 DF     Sum Sq     Mean Sq     F Value    Pr(>F) 
## ---------------------------------------------------------------
## x1               1    636.1557    636.1557        NaN       NaN 
## Residual        22    192.8906    8.767753                      
##  Lack of fit    22    192.8906    8.767753        NaN       NaN 
##  Pure Error      0        0.00         NaN                      
## ---------------------------------------------------------------
ols_plot_resid_stud(model_b4_full)

MPV::PRESS(model_b4_full)
## [1] 393.492
ols_press(model_b4_full)
## [1] 393.492
library(MASS)
bc <- boxcox(model_b4_full)

#To plot log-likelihood versus lambda #
lambda <- bc$x[which(bc$y==max(bc$y))]
lambda <- bc$x[which(bc$y==max(bc$y))]

# another package forecast for boxcox
# library(forecast)
# lambda <- BoxCox.lambda(model_b4_full)

# BoxCox(model_b4_full, lambda="auto")
# InvBoxCox(model_b4_full, lambda="auto", biasadj = FALSE, fvar = NULL)

# < log Likelihood
logLik(model_b4_full)
## 'log Lik.' -53.54133 (df=11)
model_b4_none <- update(model_b4_full, .~1)
logLik(model_b4_none)
## 'log Lik.' -76.56119 (df=2)
## 'log Lik.' -21.61487 (df=1)
# pseudo R2
1 - logLik(model_b4_full)/logLik(model_b4_none)
## 'log Lik.' 0.3006726 (df=11)
## 'log Lik.' 0.381052 (df=3)
# Interpretation of coefficients
# odds ratio
odds <- exp(coef(model_b4_full))
#prob >
odds/(1 + odds)
## (Intercept)          x1          x2          x3          x4          x5          x6          x7          x8          x9 
##   0.9999997   0.8726640   0.9990894   0.5372255   0.9383591   0.8814971   0.3988824   0.1973032   0.4907138   0.8262739
# Forward Selection Regression #
model_b4_none <- lm (y ~ 1, data = table_b4)
step(model_b4_none, direction= "forward", scope=(~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9), k=2)
## Start:  AIC=87.01
## y ~ 1
## 
##        Df Sum of Sq    RSS    AIC
## + x1    1    636.16 192.89 54.018
## + x2    1    421.27 407.78 71.984
## + x4    1    418.10 410.94 72.170
## + x3    1    349.58 479.47 75.871
## + x6    1    233.64 595.41 81.069
## + x5    1    174.56 654.49 83.339
## + x8    1    131.67 697.38 84.863
## + x7    1     66.30 762.75 87.013
## <none>              829.05 87.013
## + x9    1     57.07 771.98 87.302
## 
## Step:  AIC=54.02
## y ~ x1
## 
##        Df Sum of Sq    RSS    AIC
## + x2    1   29.1766 163.71 52.082
## <none>              192.89 54.018
## + x9    1   15.1870 177.70 54.050
## + x4    1    8.0654 184.82 54.993
## + x5    1    3.4299 189.46 55.587
## + x3    1    3.2869 189.60 55.605
## + x7    1    1.4375 191.45 55.838
## + x6    1    1.2867 191.60 55.857
## + x8    1    0.2499 192.64 55.987
## 
## Step:  AIC=52.08
## y ~ x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## <none>              163.71 52.082
## + x9    1    9.9142 153.80 52.582
## + x7    1    7.4562 156.26 52.963
## + x5    1    6.0754 157.64 53.174
## + x3    1    4.7101 159.00 53.381
## + x8    1    4.1231 159.59 53.469
## + x6    1    4.0956 159.62 53.474
## + x4    1    0.0602 163.65 54.073
## 
## Call:
## lm(formula = y ~ x1 + x2, data = table_b4)
## 
## Coefficients:
## (Intercept)           x1           x2  
##      10.042        2.713        6.164
ols_step_forward_p(model_b4_full, details = TRUE)
## Forward Selection Method    
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. x1 
## 2. x2 
## 3. x3 
## 4. x4 
## 5. x5 
## 6. x6 
## 7. x7 
## 8. x8 
## 9. x9 
## 
## We are selecting variables based on p value...
## 
## 
## Forward Selection: Step 1 
## 
## - x1 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.876       RMSE               2.961 
## R-Squared               0.767       Coef. Var          8.555 
## Adj. R-Squared          0.757       MSE                8.768 
## Pred R-Squared          0.729       MAE                2.352 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    636.156         1        636.156    72.556    0.0000 
## Residual      192.891        22          8.768                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                  Parameter Estimates                                   
## --------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t       Sig     lower     upper 
## --------------------------------------------------------------------------------------
## (Intercept)    13.320         2.572                 5.179    0.000    7.987    18.654 
##          x1     3.324         0.390        0.876    8.518    0.000    2.515     4.134 
## --------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 2 
## 
## - x2 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.896       RMSE               2.792 
## R-Squared               0.803       Coef. Var          8.067 
## Adj. R-Squared          0.784       MSE                7.796 
## Pred R-Squared          0.729       MAE                2.160 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    665.332         2        332.666    42.672    0.0000 
## Residual      163.714        21          7.796                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                   
## ---------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t       Sig      lower     upper 
## ---------------------------------------------------------------------------------------
## (Intercept)    10.042         2.958                 3.394    0.003     3.889    16.194 
##          x1     2.713         0.485        0.715    5.595    0.000     1.705     3.722 
##          x2     6.164         3.186        0.247    1.935    0.067    -0.462    12.791 
## ---------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 3 
## 
## - x9 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.902       RMSE               2.773 
## R-Squared               0.814       Coef. Var          8.012 
## Adj. R-Squared          0.787       MSE                7.690 
## Pred R-Squared          0.707       MAE                2.150 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    675.247         3        225.082     29.27    0.0000 
## Residual      153.800        20          7.690                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                   
## ---------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t       Sig      lower     upper 
## ---------------------------------------------------------------------------------------
## (Intercept)    10.340         2.950                 3.505    0.002     4.187    16.494 
##          x1     2.703         0.482        0.712    5.612    0.000     1.698     3.708 
##          x2     5.639         3.198        0.226    1.763    0.093    -1.033    12.310 
##          x9     1.516         1.336        0.112    1.135    0.270    -1.269     4.302 
## ---------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 4 
## 
## - x8 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.909       RMSE               2.753 
## R-Squared               0.826       Coef. Var          7.955 
## Adj. R-Squared          0.790       MSE                7.581 
## Pred R-Squared          0.693       MAE                2.173 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    685.002         4        171.250    22.589    0.0000 
## Residual      144.045        19          7.581                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    13.377         3.968                  3.371    0.003     5.072    21.682 
##          x1     2.379         0.557        0.627     4.267    0.000     1.212     3.545 
##          x2     6.520         3.269        0.261     1.994    0.061    -0.322    13.363 
##          x9     1.991         1.391        0.147     1.432    0.168    -0.919     4.902 
##          x8    -0.056         0.050       -0.131    -1.134    0.271    -0.160     0.047 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 5 
## 
## - x5 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.916       RMSE               2.725 
## R-Squared               0.839       Coef. Var          7.872 
## Adj. R-Squared          0.794       MSE                7.425 
## Pred R-Squared          0.671       MAE                1.998 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    695.403         5        139.081    18.732    0.0000 
## Residual      133.643        18          7.425                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    13.737         3.939                  3.488    0.003     5.463    22.012 
##          x1     2.017         0.631        0.531     3.198    0.005     0.692     3.342 
##          x2     7.224         3.290        0.290     2.196    0.041     0.313    14.136 
##          x9     2.034         1.377        0.150     1.478    0.157    -0.858     4.927 
##          x8    -0.072         0.051       -0.168    -1.417    0.174    -0.179     0.035 
##          x5     1.307         1.105        0.132     1.184    0.252    -1.013     3.628 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## No more variables to be added.
## 
## Variables Entered: 
## 
## + x1 
## + x2 
## + x9 
## + x8 
## + x5 
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.916       RMSE               2.725 
## R-Squared               0.839       Coef. Var          7.872 
## Adj. R-Squared          0.794       MSE                7.425 
## Pred R-Squared          0.671       MAE                1.998 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    695.403         5        139.081    18.732    0.0000 
## Residual      133.643        18          7.425                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    13.737         3.939                  3.488    0.003     5.463    22.012 
##          x1     2.017         0.631        0.531     3.198    0.005     0.692     3.342 
##          x2     7.224         3.290        0.290     2.196    0.041     0.313    14.136 
##          x9     2.034         1.377        0.150     1.478    0.157    -0.858     4.927 
##          x8    -0.072         0.051       -0.168    -1.417    0.174    -0.179     0.035 
##          x5     1.307         1.105        0.132     1.184    0.252    -1.013     3.628 
## ----------------------------------------------------------------------------------------
## 
##                            Selection Summary                             
## ------------------------------------------------------------------------
##         Variable                  Adj.                                      
## Step    Entered     R-Square    R-Square     C(p)       AIC        RMSE     
## ------------------------------------------------------------------------
##    1    x1            0.7673      0.7568    2.1808    124.1267    2.9610    
##    2    x2            0.8025      0.7837    0.8257    122.1907    2.7921    
##    3    x9            0.8145      0.7867    1.6857    122.6914    2.7731    
##    4    x8            0.8263      0.7897    2.5639    123.1187    2.7534    
##    5    x5            0.8388      0.7940    3.3678    123.3200    2.7248    
## ------------------------------------------------------------------------
ols_step_forward_aic(model_b4_full, details = TRUE)
## Forward Selection Method 
## ------------------------
## 
## Candidate Terms: 
## 
## 1 . x1 
## 2 . x2 
## 3 . x3 
## 4 . x4 
## 5 . x5 
## 6 . x6 
## 7 . x7 
## 8 . x8 
## 9 . x9 
## 
##  Step 0: AIC = 157.1224 
##  y ~ 1 
## 
## ----------------------------------------------------------------------
## Variable     DF      AIC      Sum Sq       RSS      R-Sq     Adj. R-Sq 
## ----------------------------------------------------------------------
## x1            1    124.127    636.156    192.891    0.767        0.757 
## x2            1    142.093    421.267    407.779    0.508        0.486 
## x4            1    142.279    418.104    410.943    0.504        0.482 
## x3            1    145.980    349.578    479.469    0.422        0.395 
## x6            1    151.178    233.640    595.406    0.282        0.249 
## x5            1    153.448    174.557    654.489    0.211        0.175 
## x8            1    154.972    131.666    697.380    0.159        0.121 
## x7            1    157.122     66.300    762.746    0.080        0.038 
## x9            1    157.411     57.067    771.979    0.069        0.027 
## ----------------------------------------------------------------------
## 
## 
## - x1 
## 
## 
##  Step 1 : AIC = 124.1267 
##  y ~ x1 
## 
## ---------------------------------------------------------------------
## Variable     DF      AIC      Sum Sq      RSS      R-Sq     Adj. R-Sq 
## ---------------------------------------------------------------------
## x2            1    122.191    29.177    163.714    0.803        0.784 
## x9            1    124.159    15.187    177.704    0.786        0.765 
## x4            1    125.102     8.065    184.825    0.777        0.756 
## x5            1    125.696     3.430    189.461    0.771        0.750 
## x3            1    125.714     3.287    189.604    0.771        0.750 
## x7            1    125.947     1.438    191.453    0.769        0.747 
## x6            1    125.966     1.287    191.604    0.769        0.747 
## x8            1    126.096     0.250    192.641    0.768        0.746 
## ---------------------------------------------------------------------
## 
## - x2 
## 
## 
##  Step 2 : AIC = 122.1907 
##  y ~ x1 + x2 
## 
## ---------------------------------------------------------------------
## Variable     DF      AIC      Sum Sq      RSS      R-Sq     Adj. R-Sq 
## ---------------------------------------------------------------------
## x9            1    122.691     9.914    153.800    0.814        0.787 
## x7            1    123.072     7.456    156.258    0.812        0.783 
## x5            1    123.283     6.075    157.639    0.810        0.781 
## x3            1    123.490     4.710    159.004    0.808        0.779 
## x8            1    123.578     4.123    159.591    0.808        0.779 
## x6            1    123.583     4.096    159.618    0.807        0.779 
## x4            1    124.182     0.060    163.654    0.803        0.773 
## ---------------------------------------------------------------------
## 
## 
## No more variables to be added.
## 
## Variables Entered: 
## 
## - x1 
## - x2 
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.896       RMSE               2.792 
## R-Squared               0.803       Coef. Var          8.067 
## Adj. R-Squared          0.784       MSE                7.796 
## Pred R-Squared          0.729       MAE                2.160 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    665.332         2        332.666    42.672    0.0000 
## Residual      163.714        21          7.796                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                   
## ---------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t       Sig      lower     upper 
## ---------------------------------------------------------------------------------------
## (Intercept)    10.042         2.958                 3.394    0.003     3.889    16.194 
##          x1     2.713         0.485        0.715    5.595    0.000     1.705     3.722 
##          x2     6.164         3.186        0.247    1.935    0.067    -0.462    12.791 
## ---------------------------------------------------------------------------------------
## 
##                         Selection Summary                          
## ------------------------------------------------------------------
## Variable       AIC      Sum Sq       RSS       R-Sq      Adj. R-Sq 
## ------------------------------------------------------------------
## x1           124.127    636.156    192.891    0.76733      0.75676 
## x2           122.191    665.332    163.714    0.80253      0.78372 
## ------------------------------------------------------------------
# k=2 for AIC the default , k= log(n) is sometimes for BIC where n is the number of data #
# Backward Elimination Regression #
step(model_b4_full, direction = "backward")
## Start:  AIC=58.97
## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
## 
##        Df Sum of Sq    RSS    AIC
## - x6    1    0.2585 122.01 57.025
## - x3    1    0.8048 122.55 57.132
## - x7    1    1.4852 123.23 57.265
## - x8    1    2.6960 124.44 57.499
## - x4    1    3.3922 125.14 57.633
## - x9    1    5.6337 127.38 58.059
## <none>              121.75 58.974
## - x5    1   18.5622 140.31 60.379
## - x2    1   23.0454 144.79 61.134
## - x1    1   30.3724 152.12 62.319
## 
## Step:  AIC=57.02
## y ~ x1 + x2 + x3 + x4 + x5 + x7 + x8 + x9
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1     0.889 122.90 55.199
## - x8    1     2.731 124.74 55.556
## - x4    1     3.312 125.32 55.667
## - x9    1     5.889 127.90 56.156
## - x7    1     9.249 131.26 56.778
## <none>              122.01 57.025
## - x5    1    18.798 140.81 58.464
## - x2    1    26.774 148.78 59.786
## - x1    1    40.508 162.51 61.905
## 
## Step:  AIC=55.2
## y ~ x1 + x2 + x4 + x5 + x7 + x8 + x9
## 
##        Df Sum of Sq    RSS    AIC
## - x8    1     3.245 126.14 53.824
## - x4    1     4.744 127.64 54.108
## - x9    1     8.718 131.61 54.844
## - x7    1     9.501 132.40 54.986
## <none>              122.90 55.199
## - x5    1    17.987 140.88 56.477
## - x2    1    25.930 148.83 57.793
## - x1    1    53.969 176.87 61.936
## 
## Step:  AIC=53.82
## y ~ x1 + x2 + x4 + x5 + x7 + x9
## 
##        Df Sum of Sq    RSS    AIC
## - x4    1     4.935 131.07 52.745
## - x9    1     5.839 131.98 52.910
## <none>              126.14 53.824
## - x5    1    19.015 145.16 55.194
## - x7    1    22.338 148.48 55.737
## - x2    1    24.770 150.91 56.127
## - x1    1    95.846 221.99 65.390
## 
## Step:  AIC=52.75
## y ~ x1 + x2 + x5 + x7 + x9
## 
##        Df Sum of Sq    RSS    AIC
## - x9    1     5.540 136.62 51.739
## <none>              131.07 52.745
## - x5    1    16.852 147.93 53.648
## - x7    1    17.471 148.55 53.748
## - x2    1    39.893 170.97 57.122
## - x1    1   158.792 289.87 69.793
## 
## Step:  AIC=51.74
## y ~ x1 + x2 + x5 + x7
## 
##        Df Sum of Sq    RSS    AIC
## <none>              136.62 51.739
## - x5    1    19.643 156.26 52.963
## - x7    1    21.024 157.64 53.174
## - x2    1    47.648 184.26 56.919
## - x1    1   157.504 294.12 68.142
## 
## Call:
## lm(formula = y ~ x1 + x2 + x5 + x7, data = table_b4)
## 
## Coefficients:
## (Intercept)           x1           x2           x5           x7  
##      13.509        2.419        8.480        2.001       -2.182
ols_step_forward_p(model_b4_full, details = TRUE)
## Forward Selection Method    
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. x1 
## 2. x2 
## 3. x3 
## 4. x4 
## 5. x5 
## 6. x6 
## 7. x7 
## 8. x8 
## 9. x9 
## 
## We are selecting variables based on p value...
## 
## 
## Forward Selection: Step 1 
## 
## - x1 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.876       RMSE               2.961 
## R-Squared               0.767       Coef. Var          8.555 
## Adj. R-Squared          0.757       MSE                8.768 
## Pred R-Squared          0.729       MAE                2.352 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    636.156         1        636.156    72.556    0.0000 
## Residual      192.891        22          8.768                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                  Parameter Estimates                                   
## --------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t       Sig     lower     upper 
## --------------------------------------------------------------------------------------
## (Intercept)    13.320         2.572                 5.179    0.000    7.987    18.654 
##          x1     3.324         0.390        0.876    8.518    0.000    2.515     4.134 
## --------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 2 
## 
## - x2 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.896       RMSE               2.792 
## R-Squared               0.803       Coef. Var          8.067 
## Adj. R-Squared          0.784       MSE                7.796 
## Pred R-Squared          0.729       MAE                2.160 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    665.332         2        332.666    42.672    0.0000 
## Residual      163.714        21          7.796                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                   
## ---------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t       Sig      lower     upper 
## ---------------------------------------------------------------------------------------
## (Intercept)    10.042         2.958                 3.394    0.003     3.889    16.194 
##          x1     2.713         0.485        0.715    5.595    0.000     1.705     3.722 
##          x2     6.164         3.186        0.247    1.935    0.067    -0.462    12.791 
## ---------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 3 
## 
## - x9 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.902       RMSE               2.773 
## R-Squared               0.814       Coef. Var          8.012 
## Adj. R-Squared          0.787       MSE                7.690 
## Pred R-Squared          0.707       MAE                2.150 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    675.247         3        225.082     29.27    0.0000 
## Residual      153.800        20          7.690                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                   
## ---------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t       Sig      lower     upper 
## ---------------------------------------------------------------------------------------
## (Intercept)    10.340         2.950                 3.505    0.002     4.187    16.494 
##          x1     2.703         0.482        0.712    5.612    0.000     1.698     3.708 
##          x2     5.639         3.198        0.226    1.763    0.093    -1.033    12.310 
##          x9     1.516         1.336        0.112    1.135    0.270    -1.269     4.302 
## ---------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 4 
## 
## - x8 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.909       RMSE               2.753 
## R-Squared               0.826       Coef. Var          7.955 
## Adj. R-Squared          0.790       MSE                7.581 
## Pred R-Squared          0.693       MAE                2.173 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    685.002         4        171.250    22.589    0.0000 
## Residual      144.045        19          7.581                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    13.377         3.968                  3.371    0.003     5.072    21.682 
##          x1     2.379         0.557        0.627     4.267    0.000     1.212     3.545 
##          x2     6.520         3.269        0.261     1.994    0.061    -0.322    13.363 
##          x9     1.991         1.391        0.147     1.432    0.168    -0.919     4.902 
##          x8    -0.056         0.050       -0.131    -1.134    0.271    -0.160     0.047 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 5 
## 
## - x5 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.916       RMSE               2.725 
## R-Squared               0.839       Coef. Var          7.872 
## Adj. R-Squared          0.794       MSE                7.425 
## Pred R-Squared          0.671       MAE                1.998 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    695.403         5        139.081    18.732    0.0000 
## Residual      133.643        18          7.425                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    13.737         3.939                  3.488    0.003     5.463    22.012 
##          x1     2.017         0.631        0.531     3.198    0.005     0.692     3.342 
##          x2     7.224         3.290        0.290     2.196    0.041     0.313    14.136 
##          x9     2.034         1.377        0.150     1.478    0.157    -0.858     4.927 
##          x8    -0.072         0.051       -0.168    -1.417    0.174    -0.179     0.035 
##          x5     1.307         1.105        0.132     1.184    0.252    -1.013     3.628 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## No more variables to be added.
## 
## Variables Entered: 
## 
## + x1 
## + x2 
## + x9 
## + x8 
## + x5 
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.916       RMSE               2.725 
## R-Squared               0.839       Coef. Var          7.872 
## Adj. R-Squared          0.794       MSE                7.425 
## Pred R-Squared          0.671       MAE                1.998 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    695.403         5        139.081    18.732    0.0000 
## Residual      133.643        18          7.425                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    13.737         3.939                  3.488    0.003     5.463    22.012 
##          x1     2.017         0.631        0.531     3.198    0.005     0.692     3.342 
##          x2     7.224         3.290        0.290     2.196    0.041     0.313    14.136 
##          x9     2.034         1.377        0.150     1.478    0.157    -0.858     4.927 
##          x8    -0.072         0.051       -0.168    -1.417    0.174    -0.179     0.035 
##          x5     1.307         1.105        0.132     1.184    0.252    -1.013     3.628 
## ----------------------------------------------------------------------------------------
## 
##                            Selection Summary                             
## ------------------------------------------------------------------------
##         Variable                  Adj.                                      
## Step    Entered     R-Square    R-Square     C(p)       AIC        RMSE     
## ------------------------------------------------------------------------
##    1    x1            0.7673      0.7568    2.1808    124.1267    2.9610    
##    2    x2            0.8025      0.7837    0.8257    122.1907    2.7921    
##    3    x9            0.8145      0.7867    1.6857    122.6914    2.7731    
##    4    x8            0.8263      0.7897    2.5639    123.1187    2.7534    
##    5    x5            0.8388      0.7940    3.3678    123.3200    2.7248    
## ------------------------------------------------------------------------
ols_step_forward_aic(model_b4_full, details = TRUE)
## Forward Selection Method 
## ------------------------
## 
## Candidate Terms: 
## 
## 1 . x1 
## 2 . x2 
## 3 . x3 
## 4 . x4 
## 5 . x5 
## 6 . x6 
## 7 . x7 
## 8 . x8 
## 9 . x9 
## 
##  Step 0: AIC = 157.1224 
##  y ~ 1 
## 
## ----------------------------------------------------------------------
## Variable     DF      AIC      Sum Sq       RSS      R-Sq     Adj. R-Sq 
## ----------------------------------------------------------------------
## x1            1    124.127    636.156    192.891    0.767        0.757 
## x2            1    142.093    421.267    407.779    0.508        0.486 
## x4            1    142.279    418.104    410.943    0.504        0.482 
## x3            1    145.980    349.578    479.469    0.422        0.395 
## x6            1    151.178    233.640    595.406    0.282        0.249 
## x5            1    153.448    174.557    654.489    0.211        0.175 
## x8            1    154.972    131.666    697.380    0.159        0.121 
## x7            1    157.122     66.300    762.746    0.080        0.038 
## x9            1    157.411     57.067    771.979    0.069        0.027 
## ----------------------------------------------------------------------
## 
## 
## - x1 
## 
## 
##  Step 1 : AIC = 124.1267 
##  y ~ x1 
## 
## ---------------------------------------------------------------------
## Variable     DF      AIC      Sum Sq      RSS      R-Sq     Adj. R-Sq 
## ---------------------------------------------------------------------
## x2            1    122.191    29.177    163.714    0.803        0.784 
## x9            1    124.159    15.187    177.704    0.786        0.765 
## x4            1    125.102     8.065    184.825    0.777        0.756 
## x5            1    125.696     3.430    189.461    0.771        0.750 
## x3            1    125.714     3.287    189.604    0.771        0.750 
## x7            1    125.947     1.438    191.453    0.769        0.747 
## x6            1    125.966     1.287    191.604    0.769        0.747 
## x8            1    126.096     0.250    192.641    0.768        0.746 
## ---------------------------------------------------------------------
## 
## - x2 
## 
## 
##  Step 2 : AIC = 122.1907 
##  y ~ x1 + x2 
## 
## ---------------------------------------------------------------------
## Variable     DF      AIC      Sum Sq      RSS      R-Sq     Adj. R-Sq 
## ---------------------------------------------------------------------
## x9            1    122.691     9.914    153.800    0.814        0.787 
## x7            1    123.072     7.456    156.258    0.812        0.783 
## x5            1    123.283     6.075    157.639    0.810        0.781 
## x3            1    123.490     4.710    159.004    0.808        0.779 
## x8            1    123.578     4.123    159.591    0.808        0.779 
## x6            1    123.583     4.096    159.618    0.807        0.779 
## x4            1    124.182     0.060    163.654    0.803        0.773 
## ---------------------------------------------------------------------
## 
## 
## No more variables to be added.
## 
## Variables Entered: 
## 
## - x1 
## - x2 
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.896       RMSE               2.792 
## R-Squared               0.803       Coef. Var          8.067 
## Adj. R-Squared          0.784       MSE                7.796 
## Pred R-Squared          0.729       MAE                2.160 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression    665.332         2        332.666    42.672    0.0000 
## Residual      163.714        21          7.796                     
## Total         829.046        23                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                   
## ---------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t       Sig      lower     upper 
## ---------------------------------------------------------------------------------------
## (Intercept)    10.042         2.958                 3.394    0.003     3.889    16.194 
##          x1     2.713         0.485        0.715    5.595    0.000     1.705     3.722 
##          x2     6.164         3.186        0.247    1.935    0.067    -0.462    12.791 
## ---------------------------------------------------------------------------------------
## 
##                         Selection Summary                          
## ------------------------------------------------------------------
## Variable       AIC      Sum Sq       RSS       R-Sq      Adj. R-Sq 
## ------------------------------------------------------------------
## x1           124.127    636.156    192.891    0.76733      0.75676 
## x2           122.191    665.332    163.714    0.80253      0.78372 
## ------------------------------------------------------------------
# Subset method #
library(leaps) # Load the package #
model_b4_subset <- regsubsets(y ~ x1 + x2 + x3 +x4 + x5 + x6 + x7 + x8 + x9, data=table_b4, nbest=10 ) # nbest is the number of models from each size #
summary(model_b4_subset) # Hard to read output from this #
## Subset selection object
## Call: regsubsets.formula(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + 
##     x9, data = table_b4, nbest = 10)
## 9 Variables  (and intercept)
##    Forced in Forced out
## x1     FALSE      FALSE
## x2     FALSE      FALSE
## x3     FALSE      FALSE
## x4     FALSE      FALSE
## x5     FALSE      FALSE
## x6     FALSE      FALSE
## x7     FALSE      FALSE
## x8     FALSE      FALSE
## x9     FALSE      FALSE
## 10 subsets of each size up to 8
## Selection Algorithm: exhaustive
##           x1  x2  x3  x4  x5  x6  x7  x8  x9 
## 1  ( 1 )  "*" " " " " " " " " " " " " " " " "
## 1  ( 2 )  " " "*" " " " " " " " " " " " " " "
## 1  ( 3 )  " " " " " " "*" " " " " " " " " " "
## 1  ( 4 )  " " " " "*" " " " " " " " " " " " "
## 1  ( 5 )  " " " " " " " " " " "*" " " " " " "
## 1  ( 6 )  " " " " " " " " "*" " " " " " " " "
## 1  ( 7 )  " " " " " " " " " " " " " " "*" " "
## 1  ( 8 )  " " " " " " " " " " " " "*" " " " "
## 1  ( 9 )  " " " " " " " " " " " " " " " " "*"
## 2  ( 1 )  "*" "*" " " " " " " " " " " " " " "
## 2  ( 2 )  "*" " " " " " " " " " " " " " " "*"
## 2  ( 3 )  "*" " " " " "*" " " " " " " " " " "
## 2  ( 4 )  "*" " " " " " " "*" " " " " " " " "
## 2  ( 5 )  "*" " " "*" " " " " " " " " " " " "
## 2  ( 6 )  "*" " " " " " " " " " " "*" " " " "
## 2  ( 7 )  "*" " " " " " " " " "*" " " " " " "
## 2  ( 8 )  "*" " " " " " " " " " " " " "*" " "
## 2  ( 9 )  " " "*" "*" " " " " " " " " " " " "
## 2  ( 10 ) " " "*" " " " " " " " " " " "*" " "
## 3  ( 1 )  "*" "*" " " " " " " " " " " " " "*"
## 3  ( 2 )  "*" "*" " " " " " " " " "*" " " " "
## 3  ( 3 )  "*" "*" " " " " "*" " " " " " " " "
## 3  ( 4 )  "*" "*" "*" " " " " " " " " " " " "
## 3  ( 5 )  "*" "*" " " " " " " " " " " "*" " "
## 3  ( 6 )  "*" "*" " " " " " " "*" " " " " " "
## 3  ( 7 )  "*" "*" " " "*" " " " " " " " " " "
## 3  ( 8 )  "*" " " " " "*" " " " " " " " " "*"
## 3  ( 9 )  "*" " " " " " " " " " " " " "*" "*"
## 3  ( 10 ) "*" " " " " " " " " "*" " " " " "*"
## 4  ( 1 )  "*" "*" " " " " "*" " " "*" " " " "
## 4  ( 2 )  "*" "*" " " " " " " " " " " "*" "*"
## 4  ( 3 )  "*" "*" " " " " "*" "*" " " " " " "
## 4  ( 4 )  "*" "*" " " " " " " "*" " " " " "*"
## 4  ( 5 )  "*" "*" " " " " " " " " "*" " " "*"
## 4  ( 6 )  "*" "*" " " " " "*" " " " " " " "*"
## 4  ( 7 )  "*" "*" " " " " "*" " " " " "*" " "
## 4  ( 8 )  "*" "*" "*" " " "*" " " " " " " " "
## 4  ( 9 )  "*" "*" "*" " " " " " " " " " " "*"
## 4  ( 10 ) "*" "*" "*" " " " " " " "*" " " " "
## 5  ( 1 )  "*" "*" "*" " " "*" " " "*" " " " "
## 5  ( 2 )  "*" "*" " " " " "*" " " "*" " " "*"
## 5  ( 3 )  "*" "*" " " "*" "*" " " "*" " " " "
## 5  ( 4 )  "*" "*" " " " " "*" "*" " " " " "*"
## 5  ( 5 )  "*" "*" " " " " "*" " " " " "*" "*"
## 5  ( 6 )  "*" "*" " " " " "*" "*" "*" " " " "
## 5  ( 7 )  "*" "*" " " " " "*" " " "*" "*" " "
## 5  ( 8 )  "*" "*" "*" " " "*" "*" " " " " " "
## 5  ( 9 )  "*" "*" " " "*" " " " " " " "*" "*"
## 5  ( 10 ) "*" "*" "*" " " "*" " " " " "*" " "
## 6  ( 1 )  "*" "*" " " "*" "*" " " "*" " " "*"
## 6  ( 2 )  "*" "*" " " " " "*" " " "*" "*" "*"
## 6  ( 3 )  "*" "*" "*" " " "*" " " "*" " " "*"
## 6  ( 4 )  "*" "*" " " " " "*" "*" " " "*" "*"
## 6  ( 5 )  "*" "*" "*" "*" "*" " " "*" " " " "
## 6  ( 6 )  "*" "*" " " "*" "*" "*" " " " " "*"
## 6  ( 7 )  "*" "*" "*" " " "*" "*" " " " " "*"
## 6  ( 8 )  "*" "*" "*" " " "*" " " "*" "*" " "
## 6  ( 9 )  "*" "*" "*" " " "*" "*" "*" " " " "
## 6  ( 10 ) "*" "*" " " " " "*" "*" "*" " " "*"
## 7  ( 1 )  "*" "*" " " "*" "*" " " "*" "*" "*"
## 7  ( 2 )  "*" "*" " " "*" "*" "*" " " "*" "*"
## 7  ( 3 )  "*" "*" "*" "*" "*" " " "*" " " "*"
## 7  ( 4 )  "*" "*" "*" " " "*" " " "*" "*" "*"
## 7  ( 5 )  "*" "*" " " "*" "*" "*" "*" " " "*"
## 7  ( 6 )  "*" "*" "*" " " "*" "*" " " "*" "*"
## 7  ( 7 )  "*" "*" "*" "*" "*" "*" " " " " "*"
## 7  ( 8 )  "*" "*" " " " " "*" "*" "*" "*" "*"
## 7  ( 9 )  "*" "*" "*" " " "*" "*" "*" " " "*"
## 7  ( 10 ) "*" "*" "*" "*" "*" " " "*" "*" " "
## 8  ( 1 )  "*" "*" "*" "*" "*" " " "*" "*" "*"
## 8  ( 2 )  "*" "*" " " "*" "*" "*" "*" "*" "*"
## 8  ( 3 )  "*" "*" "*" "*" "*" "*" " " "*" "*"
## 8  ( 4 )  "*" "*" "*" "*" "*" "*" "*" " " "*"
## 8  ( 5 )  "*" "*" "*" " " "*" "*" "*" "*" "*"
## 8  ( 6 )  "*" "*" "*" "*" "*" "*" "*" "*" " "
## 8  ( 7 )  "*" "*" "*" "*" " " "*" "*" "*" "*"
## 8  ( 8 )  "*" " " "*" "*" "*" "*" "*" "*" "*"
## 8  ( 9 )  " " "*" "*" "*" "*" "*" "*" "*" "*"
ols_step_all_possible(model_b4_full)
## # A tibble: 511 x 6
##    Index     N Predictors `R-Square` `Adj. R-Square` `Mallow's Cp`
##  * <int> <int> <chr>           <dbl>           <dbl>         <dbl>
##  1     1     1 x1             0.767           0.757          2.18 
##  2     2     1 x2             0.508           0.486         26.9  
##  3     3     1 x4             0.504           0.482         27.3  
##  4     4     1 x3             0.422           0.395         35.1  
##  5     5     1 x6             0.282           0.249         48.5  
##  6     6     1 x5             0.211           0.175         55.3  
##  7     7     1 x8             0.159           0.121         60.2  
##  8     8     1 x7             0.0800          0.0382        67.7  
##  9     9     1 x9             0.0688          0.0265        68.8  
## 10    10     2 x1 x2          0.803           0.784          0.826
## # ... with 501 more rows
## plot adjusted R square for each model ##
plot(model_b4_subset, scale='adjr2')

## can use Cp, r2 or bic for scale ##
plot(model_b4_subset, scale='bic')

plot(model_b4_subset, scale='Cp')

ols_step_best_subset(model_b4_full)
##          Best Subsets Regression         
## -----------------------------------------
## Model Index    Predictors
## -----------------------------------------
##      1         x1                         
##      2         x1 x2                      
##      3         x1 x2 x9                   
##      4         x1 x2 x5 x7                
##      5         x1 x2 x3 x5 x7             
##      6         x1 x2 x4 x5 x7 x9          
##      7         x1 x2 x4 x5 x7 x8 x9       
##      8         x1 x2 x3 x4 x5 x7 x8 x9    
##      9         x1 x2 x3 x4 x5 x6 x7 x8 x9 
## -----------------------------------------
## 
##                                                    Subsets Regression Summary                                                    
## ---------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                           
## Model    R-Square    R-Square    R-Square     C(p)        AIC        SBIC        SBC        MSEP        FPE       HSP       APC  
## ---------------------------------------------------------------------------------------------------------------------------------
##   1        0.7673      0.7568      0.7291     2.1808    124.1267    56.3323    127.6609     9.5680     9.4984    0.4175    0.2750 
##   2        0.8025      0.7837      0.7293     0.8257    122.1907    55.5796    126.9029     8.9328     8.7704    0.3898    0.2539 
##   3        0.8145      0.7867       0.707     1.6857    122.6914    57.1836    128.5817     9.2752     8.9717    0.4047    0.2597 
##   4        0.8352      0.8005      0.7181     1.7095    121.8477    58.4590    128.9161     9.1543     8.6882    0.3995    0.2515 
##   5        0.8422      0.7984      0.6844     3.0405    122.8033    61.1328    131.0496     9.7955     9.0831    0.4274    0.2629 
##   6        0.8478      0.7941      0.6992     4.5051    123.9333    64.1315    133.3577    10.6276     9.5842    0.4638    0.2775 
##   7        0.8518      0.7869      0.6527     6.1320    125.3079    67.3960    135.9103    11.7349    10.2413    0.5121    0.2965 
##   8        0.8528      0.7743      0.6037     8.0297    127.1336    70.8062    138.9141    13.3142    11.1839    0.5810    0.3238 
##   9        0.8531      0.7587      0.5254    10.0000    129.0827    74.2389    142.0413    15.3300    12.3198    0.6689    0.3566 
## ---------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria
# The following function in the SignifReg package can be used to change selection criterion easily to p value or others. Change options accordingly before running it #
library(SignifReg)
# SignifReg(scope, data, alpha = 0.05, direction = "forward", criterion = "p-value", correction = "FDR") #
# Check it on your own later #



# Final models #

model_b4_1 <- lm(y ~ x1 + x2, data=table_b4)
summary(model_b4_1)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = table_b4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7639 -1.9454 -0.1822  1.8068  5.0423 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.0418     2.9585   3.394  0.00273 ** 
## x1            2.7134     0.4849   5.595 1.49e-05 ***
## x2            6.1643     3.1864   1.935  0.06663 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.792 on 21 degrees of freedom
## Multiple R-squared:  0.8025, Adjusted R-squared:  0.7837 
## F-statistic: 42.67 on 2 and 21 DF,  p-value: 4.007e-08
anova(model_b4_1)
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x1         1 636.16  636.16 81.6013 1.114e-08 ***
## x2         1  29.18   29.18  3.7426   0.06663 .  
## Residuals 21 163.71    7.80                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif(model_b4_1)
##       x1       x2 
## 1.736466 1.736466
confint(model_b4_1, level=0.1/1) # Bonferroni joint confidence interval #
##                 45 %      55 %
## (Intercept) 9.665478 10.418052
## x1          2.651718  2.775079
## x2          5.758991  6.569541
plot(model_b4_1, pch=16, col="blue")

#Create Partial Regression plots #
avPlots(model_b4_1)

model_b4_backward <- lm(y ~ x1 + x2 + x5 + x7, data=table_b4)
summary(model_b4_backward)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x5 + x7, data = table_b4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5510 -2.0807  0.0391  1.8805  3.5912 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.5091     3.6331   3.718 0.001458 ** 
## x1            2.4192     0.5169   4.680 0.000163 ***
## x2            8.4802     3.2943   2.574 0.018577 *  
## x5            2.0006     1.2104   1.653 0.114793    
## x7           -2.1823     1.2762  -1.710 0.103557    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.681 on 19 degrees of freedom
## Multiple R-squared:  0.8352, Adjusted R-squared:  0.8005 
## F-statistic: 24.08 on 4 and 19 DF,  p-value: 3.249e-07
anova(model_b4_backward)
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x1         1 636.16  636.16 88.4747 1.398e-08 ***
## x2         1  29.18   29.18  4.0578   0.05834 .  
## x5         1   6.08    6.08  0.8449   0.36951    
## x7         1  21.02   21.02  2.9239   0.10356    
## Residuals 19 136.61    7.19                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif(model_b4_backward)
##       x1       x2       x5       x7 
## 2.138863 2.012386 1.712819 1.661124
confint(model_b4_backward, level=0.1/3) # Bonferroni joint confidence interval #
##                48.3 %    51.7 %
## (Intercept) 13.355217 13.662895
## x1           2.397272  2.441046
## x2           8.340759  8.619740
## x5           1.949333  2.051838
## x7          -2.236290 -2.128212
plot(model_b4_backward, pch=16, col="blue")

#Create Partial Regression plots #
avPlots(model_b4_backward)

deviation <- table_b4$y-mean(table_b4$y)

# Predit_Power=1-(PRESS.stat/SST)
1-((MPV::PRESS(model_b4_full))/(deviation%*%deviation)) # Compute SST by multiplying two vectors #
##           [,1]
## [1,] 0.5253679
1-((MPV::PRESS(model_b4_full))/(var(table_b4$y)*(nrow(table_b4)-1)))
## [1] 0.5253679

olsrr

Introduction

1 Regression

In the presence of interaction terms in the model, the predictors are scaled and centered before computing the standardized betas. ols_regress() will detect interaction terms automatically but in case you have created a new variable instead of using the inline function *, you can indicate the presence of interaction terms by setting iterm to TRUE.

ols_regress(mpg ~ disp + hp + wt + qsec, data = mtcars)

2 Residual vs Fitted Values Plot

Plot to detect non-linearity, unequal error variances, and outliers.

ols_plot_resid_fit(model)

3 DFBETAs Panel

DFBETAs measure the difference in each parameter estimate with and without the influential observation. dfbetas_panel creates plots to detect influential observations using DFBETAs.

ols_plot_dfbetas(model)

4 Residual Fit Spread Plot

Plot to detect non-linearity, influential observations and outliers.

ols_plot_resid_fit_spread(model)

5 Breusch Pagan Test

Breusch Pagan test is used to test for herteroskedasticity (non-constant error variance). It tests whether the variance of the errors from a regression is dependent on the values of the independent variables. It is a χ2 test.

ols_test_breusch_pagan(model)

6 Collinearity Diagnostics

ols_coll_diag(model)

7 Stepwise Regression

Build regression model from a set of candidate predictor variables by entering and removing predictors based on p values, in a stepwise manner until there is no variable left to enter or remove any more.

  • Variable Selection

model <- lm(y ~ ., data = surgical) ols_step_both_p(model)

  • Plot

8 Stepwise AIC Backward Regression

Build regression model from a set of candidate predictor variables by removing predictors based on Akaike Information Criteria, in a stepwise manner until there is no variable left to remove any more.

  • Variable Selection

k <- ols_step_backward_aic(model)

  • Plot

Variable Selection Methods

1 All Possible Regression

ols_step_all_possible(model)

plot(k)

2 Best Subset Regression

Select the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the largest R2 value or the smallest MSE, Mallow’s Cp or AIC.

ols_step_best_subset(model)

plot(k)

3 Stepwise Forward Regression

  • Variable Selection

ols_step_forward_p(model)

plot(k)

  • Detailed Output

ols_step_forward_p(model, details = TRUE)

4 Stepwise Backward Regression

ols_step_backward_p(model, details = TRUE)

5 Stepwise Regression

ols_step_both_p(model, details = TRUE)

6 Stepwise AIC Forward Regression

ols_step_forward_aic(model, details = TRUE)

7 Stepwise AIC Backward Regression

ols_step_backward_aic(model, details = TRUE)

8 Stepwise AIC Regression

ols_step_both_aic(model, details = TRUE)

Residual Diagnostics

The standard regression assumptions include the following about residuals/errors: - The error has a normal distribution (normality assumption). - The errors have mean zero. - The errors have same but unknown variance (homoscedasticity assumption). - The error are independent of each other (independent errors assumption)

1 Residual QQ Plot

ols_plot_resid_qq(model)

2 Residual Normality Test

Test for detecting violation of normality assumption.

ols_test_normality(model)

Correlation between observed residuals and expected residuals under normality.

ols_test_correlation(model)

3 Residual vs Fitted Values Plot

Characteristics of a well behaved residual vs fitted plot:

  • The residuals spread randomly around the 0 line indicating that the relationship is linear.
  • The residuals form an approximate horizontal band around the 0 line indicating homogeneity of error variance.
  • No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers.

ols_plot_resid_fit(model)

4 Residual Histogram

Histogram of residuals for detecting violation of normality assumption.

ols_plot_resid_hist(model)

Heteroskedasticity

1 Bartlett Test

  • Use grouping variable

ols_test_bartlett(hsb, read, group_var = female)

  • Using variables

ols_test_bartlett(hsb, read, write)

2 Breusch Pagan Test

  • Use fitted values of the model

ols_test_breusch_pagan(model)

  • Use independent variables of the model

ols_test_breusch_pagan(model, rhs = TRUE)

  • Use independent variables of the model and perform multiple tests

ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE)

  • Bonferroni p value Adjustment

ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE, p.adj = ‘bonferroni’)

  • Sidak p value Adjustment

ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE, p.adj = ‘sidak’)

  • Holm’s p value Adjustment

ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE, p.adj = ‘holm’)

3 Score Test

  • Use fitted values of the model

ols_test_score(model)

  • Use independent variables of the model

ols_test_score(model, rhs = TRUE)

  • Specify variables

ols_test_score(model, vars = c(‘disp’, ‘hp’))

4 F Test

  • Use fitted values of the model

ols_test_f(model)

  • Use independent variables of the model

ols_test_f(model, rhs = TRUE)

  • Specify variables

ols_test_f(model, vars = c(‘disp’, ‘hp’))

Measures of Influence

1 Cook’s D Bar Plot

Steps to compute Cook’s distance:

  • delete observations one at a time.
  • refit the regression model on remaining (n−1) observations.
  • examine how much all of the fitted values change when the ith observation is deleted.

A data point having a large cook’s d indicates that the data point strongly influences the fitted values

ols_plot_cooksd_bar(model)

2 Cook’s D Chart

ols_plot_cooksd_chart(model)

3 DFBETAs Panel

DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each data point i.e if there are n observations and k variables, there will be n∗k DFBETAs. In general, large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch recommend 2 as a general cutoff value to indicate influential observations and \(\frac{2}{\sqrt{n}}\) as a size-adjusted cutoff.

ols_plot_dfbetas(model)

4 DFFITs Plot

Steps to compute DFFITs:

  • delete observations one at a time.
  • refit the regression model on remaining observations.
  • examine how much all of the fitted values change when the ith observation is deleted.

An observation is deemed influential if the absolute value of its DFFITS value is greater than:\({2}*{\frac{\sqrt{(p + 1)}}{(n - p -1)}}\) where n is the number of observations and p is the number of predictors including intercept.

ols_plot_dffits(model)

5 Studentized Residual Plot

Plot for detecting outliers. Studentized deleted residuals (or externally studentized residuals) is the deleted residual divided by its estimated standard deviation. Studentized residuals are going to be more effective for detecting outlying Y observations than standardized residuals. If an observation has an externally studentized residual that is larger than 3 (in absolute value) we can call it an outlier.

ols_plot_resid_stud(model)

6 Standardized Residual Chart

Chart for detecting outliers. Standardized residual (internally studentized) is the residual divided by estimated standard deviation.

ols_plot_resid_stand(model)

7 Studentized Residuals vs Leverage Plot

Graph for detecting influential observations.

ols_plot_resid_lev(model)

8 Deleted Studentized Residual vs Fitted Values Plot

ols_plot_resid_stud_fit(model)

9 Hadi Plot

Hadi’s measure of influence based on the fact that influential observations can be present in either the response variable or in the predictors or both. The plot is used to detect influential observations based on Hadi’s measure.

ols_plot_hadi(model)

10 Potential Residual Plot

Plot to aid in classifying unusual observations as high-leverage points, outliers, or a combination of both.

ols_plot_resid_pot(model)

Collinearity, Model Fit & Variable Contribution

1 Collinearity Diagnostics

  • VIF

ols_vif_tol(model)

  • Tolerance

Percent of variance in the predictor that cannot be accounted for by other predictors.

  • Condition Index

Most multivariate statistical approaches involve decomposing a correlation matrix into linear combinations of variables. The linear combinations are chosen so that the first combination has the largest possible variance (subject to some restrictions we won’t discuss), the second combination has the next largest variance, subject to being uncorrelated with the first, the third has the largest possible variance, subject to being uncorrelated with the first and second, and so forth. The variance of each of these linear combinations is called an eigenvalue. Collinearity is spotted by finding 2 or more variables that have large proportions of variance (.50 or more) that correspond to large condition indices. A rule of thumb is to label as large those condition indices in the range of 30 or larger.

ols_eigen_cindex(model)

  • Collinearity Diagnostics

ols_coll_diag(model)

2 Model Fit Assessment

  • Residual Fit Spread Plot

ols_plot_resid_fit_spread(model)

  • Part & Partial Correlations

Correlations: Relative importance of independent variables in determining Y. How much each variable uniquely contributes to R2 over and above that which can be accounted for by the other predictors.

Zero Order: Pearson correlation coefficient between the dependent variable and the independent variables.

Part: Unique contribution of independent variables. How much R2 will decrease if that variable is removed from the model?

Partial: How much of the variance in Y, which is not estimated by the other independent variables in the model, is estimated by the specific variable?

ols_correlations(model)

  • Observed vs Predicted Plot

ols_plot_obs_fit(model)

  • Lack of Fit F Test

ols_pure_error_anova(model)

  • Diagnostics Panel

ols_plot_diagnostics(model)

3 Variable Contributions

  • Residual vs Regressor Plots

Graph to determine whether we should add a new predictor to the model already containing other predictors. The residuals from the model is regressed on the new predictor and if the plot shows non random pattern, you should consider adding the new predictor to the model.

ols_plot_resid_regressor(model, drat)

  • Added Variable Plot

Regress Y on all variables other than X and store the residuals (Y residuals).
Regress X on all the other variables included in the model (X residuals).
Construct a scatter plot of Y residuals and X residuals.

ols_plot_added_variable(model)

  • Residual Plus Component Plot

Regress Y on all variables including X and store the residuals (e). Multiply e with regression coefficient of X (eX). Construct scatter plot of eX and X

ols_plot_comp_plus_resid(model)